Project Part 3

Jack Provance

After collecting a large sample and performing intial centrality analysis in parts one and two, this section tailors centrality scores and creates a clear graph to explain conclusions.

Before addressing centrality in this network, a quick summary of the collected data is given below. This data set is the "friend" (following) relationship in a network of Twitter users. The collection alogirthm started at a given node, collected a sample of following relationships, moved to one of the following relationships, and then repeated the process. I used the Wall Street Journal (@WSJ), Financial Times (@ft), Bloomberg (@business), New York Times (@NYTimes), and Buzz Feed (@BuzzFeed) as starting nodes to compare the links between these networks. Part two of this project showed the shortest distances between the starting nodes and some basic centrality scores.

In [8]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib as mpl
import matplotlib.pyplot as plt
import json
from networkx.readwrite import json_graph
In [43]:
all_connections = pd.read_csv('AllConnections.csv', usecols=[1,2])

all_connections.columns = ['User', 'Following']
all_users = np.hstack([all_connections['User'].unique(), all_connections[~all_connections['Following'].isin(all_connections['User'].unique())]['Following'].unique()])

G = nx.DiGraph()

G.add_nodes_from(all_users)
G.add_edges_from(list(all_connections.to_records(index=False)))

print('This network has {} unique Twitter users and {} connections between these users.'.format(G.number_of_nodes(), G.number_of_edges()))
This network has 16398 unique Twitter users and 20302 connections between these users.

There are many measurements of centrality in graph theory that attempt to quantify different relationships. To choose the best measure for this application, the relevant calculations need to be compared. Because this is a sparse, directed graph with no weights, the relevant calculations used in this analysis are Degree, Eigenvector, Betweeness, In-Degree, Out-Degree, and Katz centrality scores.

To compare these scores visually, the graphs are redrawn with the color of the nodes changing based on their respective scores, with a darker color representing a higher centrality score.

In [2]:
nx.set_node_attributes(G, nx.degree_centrality(G), 'degree')
nx.set_node_attributes(G, nx.eigenvector_centrality(G), 'eigenvector')
nx.set_node_attributes(G, nx.betweenness_centrality(G), 'betweeness')
nx.set_node_attributes(G, nx.in_degree_centrality(G), 'in-degree')
nx.set_node_attributes(G, nx.out_degree_centrality(G), 'out-degree')
nx.set_node_attributes(G, nx.katz_centrality(G), 'katz')

with open('graph.json', 'w') as f:
    json.dump(json_graph.node_link_data(G), f)
In [53]:
with open('graph.json', 'r') as f:
    G = json_graph.node_link_graph(json.load(f))

deg = pd.DataFrame(G.degree())
deg.columns = ['node', 'degree']
G.remove_nodes_from(deg[deg['degree'] < 4]['node'])
pos = nx.spring_layout(G)

score = ['degree', 'betweeness', 'eigenvector', 'in-degree', 'out-degree', 'katz']
cmap = ['Purples', 'Greens', 'Blues', 'Oranges', 'Reds', 'Greys']
arr2d = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

fig, axs = plt.subplots(2, 3, figsize=[60, 40])

for n in range(0, 6):
    d = nx.get_node_attributes(G, score[n])
    low, *_, high = sorted(d.values())
    norm = mpl.colors.Normalize(vmin=low, vmax=high, clip=True)
    mapper = mpl.cm.ScalarMappable(norm=norm, cmap=cmap[n])
    nx.draw(G, pos, ax=axs[arr2d[n]], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3, edge_color='white')
    axs[arr2d[n]].set_title(score[n])

plt.savefig('central_comp_s.png')
plt.show()

The plots above are the same layout of nodes but colored by different centrality scores to emphasize the difference between the scores. The graph is also simplified to only show nodes with at least four edges. The layout of the graphs is built using the spring algorithm.

The 'degree' plot is the most intuitive because it is a normalized measure of the number of connections for each node. The 'in-' and 'out-degree' plots on the bottom row are interesting decompostions of the 'degree' plot. Because the following relationship is directed from one user to another, the 'out-degree' graph is a good representation of how the collection algorithm worked. Each node in the central group has a high out degree score because these nodes were used as central nodes to collect the list of following relationships. The 'in-degree' plot shows that some, but not all, of the collection nodes are followed by others in the network. This plot also shows that some nodes in the outer ring have more followers than the inner group.This suggests that the group of nodes used to collect following relationships is not well connected internally.

The 'betweeness', 'eigenvector', and 'katz' plots contrast the 'degree' plots. These centrality scores are designed to highlight "important" nodes in different ways. The eigenvector score rewards connections to well-connected nodes, the betweeness score ranks nodes by the number of times a shortest path between other nodes requires the node of interest, and the Katz score is similar to the basic degree score but it includes all connected nodes with a penalty that increases as the shortest path between two nodes increases. The 'betweeness' plot in green shows that only a handful of the central nodes from the 'out-degree' plot are valuable links between other nodes. Although the central nodes have a lot of connections because of the collection process, these connections are not equally valuable. The 'eigenvector' and 'katz' plots support this observation from the 'betweeness' plot and highlight some nodes in the outer ring. The darker nodes on the outer ring are connected to other well connected nodes.

In terms of the movement of information, the 'eigenvector' and 'katz' plots show important voices in this network. The high scoring nodes on the outer ring have a wide reach despite a lower number of connections. The high scoring nodes in the interior are the valuable well-connected users. Distributing information from a news organization to the public requires a stable network. The 'betweeness' plot shows the users that are necessary to keep information moving smoothly. An organization can use these scores to distribute resources to high impact users and protect crucial connections.

An important consideration when presenting this network visually is the algorithm used to arrange the nodes. The comparison of centrality scores above used the spring_layout function from the NetworkX package. This implements the Fruchterman-Reingold force-directed algorithm to assign x-y positions to the nodes based on the edges. The edges can be thought of as springs holding the nodes together while the nodes repel other nodes. This layout keeps the highly connected nodes with high degree scores in the center. Other layout algorithms, compared below, include Kamada-Kawai, spectral, and shell.

In [49]:
with open('graph.json', 'r') as f:
    G = json_graph.node_link_graph(json.load(f))

deg = pd.DataFrame(G.degree())
deg.columns = ['node', 'degree']
G.remove_nodes_from(deg[deg['degree'] < 4]['node'])

d = nx.get_node_attributes(G, 'eigenvector')
low, *_, high = sorted(d.values())
norm = mpl.colors.Normalize(vmin=low, vmax=high, clip=True)
mapper = mpl.cm.ScalarMappable(norm=norm, cmap='Blues')

fig, axs = plt.subplots(2, 2, figsize=[40, 40])

# Spring, Kamada-Kawai, Spectral, Spiral, Shell, Planar
pos = nx.spring_layout(G)
nx.draw(G, pos, ax=axs[0, 0], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3)
axs[0, 0].set_title('Spring')

pos = nx.kamada_kawai_layout(G)
nx.draw(G, pos, ax=axs[0, 1], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3)
axs[0, 1].set_title('Kamada-Kawai')

pos = nx.spectral_layout(G)
nx.draw(G, pos, ax=axs[1, 0], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3)
axs[1, 0].set_title('Spectral')

pos = nx.shell_layout(G)
nx.draw(G, pos, ax=axs[1, 1], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3)
axs[1, 1].set_title('Shell')

plt.savefig('layout_comp.png')
plt.show()

The Kamada Kawai layout alogrithm is similar to the spring layout because it minimizes the spring energy of the system. These two algorithms iterate and make incremental shifts until a minima is reached. The spectral layout is based on the eigenvectors of the Laplacian matrix of the network. This matrix represents the edges and nodes of the network. The shell layout places the nodes on concentric circles. The layout above is only one circle but the shell layout is a great way to visualize nodes if the nodes are grouped in some way. Clustering the nodes based on other attributes and then creating a shell layout based on the groups would be an interesting visual design of the information.

For this specific data set the force directed graphs are the most intuitive ways to visualize the network. A final visualization needs to incorporate the centrality scores and force directed layout.

In [52]:
# RUN betweeness and eigenvector FOR spring and kamada-kawai
with open('graph.json', 'r') as f:
    G = json_graph.node_link_graph(json.load(f))

deg = pd.DataFrame(G.degree())
deg.columns = ['node', 'degree']
G.remove_nodes_from(deg[deg['degree'] < 4]['node'])
pos = nx.kamada_kawai_layout(G)

score = ['degree', 'betweeness', 'eigenvector', 'in-degree', 'out-degree', 'katz']
cmap = ['Purples', 'Greens', 'Blues', 'Oranges', 'Reds', 'Greys']
arr2d = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

fig, axs = plt.subplots(2, 3, figsize=[60, 40])

for n in range(0, 6):
    d = nx.get_node_attributes(G, score[n])
    low, *_, high = sorted(d.values())
    norm = mpl.colors.Normalize(vmin=low, vmax=high, clip=True)
    mapper = mpl.cm.ScalarMappable(norm=norm, cmap=cmap[n])
    nx.draw(G, pos, ax=axs[arr2d[n]], node_color=[mapper.to_rgba(i) for i in d.values()], alpha=0.3, edge_color='white')
    axs[arr2d[n]].set_title(score[n])

plt.savefig('central_comp_kk.png')
plt.show()

This set of plots compares the centrality scores using the Kamada Kawai layout algorithm instead of the previous spring layout. An interesting difference between the two layouts is the 'out-degree' plot. With the spring layout these points are collected in the center but with Kamada Kawai layout the points appear evenly distrbuted. The 'katz' looks much more like the 'in-degree' plot in this layout. This makes sense because the Katz centrality score generalizes the degree centrality score to include all connected edges in the graph. The 'betweeness' and 'eigenvector' plots are still the better measures because those scores reward immediate connections to other well-connected nodes. To create a single, final graph that clarifies the network, the betweeness and eigenvector centrality scores will be used as node size and color, respectively. The Kamada Kawai layout is preferred because the well connected nodes are not grouped in the center, rather pulled towards the connected parts of the network.

In [60]:
# This block is commented out because it took hours to run. The output is also saved as a PNG file called 'final.png' in the repo
# with open('graph.json', 'r') as f:
#     G = json_graph.node_link_graph(json.load(f))

# # deg = pd.DataFrame(G.degree())
# # deg.columns = ['node', 'degree']
# # G.remove_nodes_from(deg[deg['degree'] < 4]['node'])
# pos = nx.kamada_kawai_layout(G)

# fig, ax = plt.subplots(figsize=[50, 50])

# d = nx.get_node_attributes(G, 'eigenvector')
# low, *_, high = sorted(d.values())
# norm = mpl.colors.Normalize(vmin=low, vmax=high, clip=True)
# mapper = mpl.cm.ScalarMappable(norm=norm, cmap='Blues')
# nx.draw_networkx_nodes(G, pos, ax=ax, node_color=[mapper.to_rgba(i) for i in d.values()], node_size=[i*1000000 + 300 for i in list(nx.get_node_attributes(G, 'betweeness').values())], alpha=0.5)
# nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.05, arrowstyle='wedge')

# plt.savefig('final.png')
# plt.show()

This is the entire network drawn with the Kamada Kawai layout. The nodes are colored by their eigenvector centrality scores and scaled up from the default size by their betweeness centrality scores. When all the edges are drawn, the centrality scores are clear. There is a group of nodes in the center that share connections to the central nodes.

In [65]:
with open('graph.json', 'r') as f:
    G = json_graph.node_link_graph(json.load(f))

deg = pd.DataFrame(G.degree())
deg.columns = ['node', 'degree']
G.remove_nodes_from(deg[deg['degree'] < 4]['node'])
pos = nx.kamada_kawai_layout(G)

fig, ax = plt.subplots(figsize=[50, 50])

d = nx.get_node_attributes(G, 'eigenvector')
low, *_, high = sorted(d.values())
norm = mpl.colors.Normalize(vmin=low, vmax=high, clip=True)
mapper = mpl.cm.ScalarMappable(norm=norm, cmap='Blues')
nx.draw_networkx_nodes(G, pos, ax=ax, node_color=[mapper.to_rgba(i) for i in d.values()], node_size=[i*5000000 + 300 for i in list(nx.get_node_attributes(G, 'betweeness').values())], alpha=0.5)
nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.05, arrowstyle='wedge')

plt.savefig('final_zoom.png')
plt.show()

This plot simplifies the network to just nodes with 4 or more connections. Because the underlying graph changed, the layout changed but the pattern is still clear. The nodes around the outside of the graph are terminal, meaning information does not continue to move once it hits these nodes. The larger nodes a central to moving information throughout. The darker blue nodes are important because they share connections with well-connected nodes.

The news organizations that were used as starting nodes are interested in receiving and distributing information quickly. Using this type of graph to measure users can help identify valuable connections. Building a stable network of information is crucial to reporting.

Further research in this project should move away from Python for data collection and storage. A grpah database like Neo4J is the correct way to collect and store this data at scale. This tool is a graph database that is designed to store networks of nodes and edges. There is already research in to using a graph database to analyze Twitter connections, such as this report, this Medium post, and this blog post.

Python and NetowrkX would still be useful for centrality scores, but the storage should be optimized. The notebook that performs the analysis should query only the necessary data from the database for performance.

Another extension is to add weights such as activity, like-comment ratio, average like, standard deviation of like, sentiment analysis, etc. Obviously not all Twitter users are equal and any measure of centrality should include measurements of use and influence. Also, adding a layer of clustering based on these attributes would make the shell layout much more interesting. For example, users could be grouped by a set of metrics that relate to use and then the most active users would be in the center circle. Edges moving between the shells would help differentiate people with similar usage.

Referenced: https://stackoverflow.com/questions/13517614/draw-different-color-for-nodes-in-networkx-based-on-their-node-value?rq=1

In [ ]: